<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article# " lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>getting-sample-data-from-external-sources | 绿萝间</title>
<link href="../assets/css/all-nocdn.css" rel="stylesheet" type="text/css">
<link href="../assets/css/ipython.min.css" rel="stylesheet" type="text/css">
<link href="../assets/css/nikola_ipython.css" rel="stylesheet" type="text/css">
<meta name="theme-color" content="#5670d4">
<meta name="generator" content="Nikola (getnikola.com)">
<link rel="alternate" type="application/rss+xml" title="RSS" href="../rss.xml">
<link rel="canonical" href="https://muxuezi.github.io/posts/getting-sample-data-from-external-sources.html">
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
    tex2jax: {
        inlineMath: [ ['$','$'], ["\\(","\\)"] ],
        displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
        processEscapes: true
    },
    displayAlign: 'center', // Change this to 'center' to center equations.
    "HTML-CSS": {
        styles: {'.MathJax_Display': {"margin": 0}}
    }
});
</script><!--[if lt IE 9]><script src="../assets/js/html5.js"></script><![endif]--><meta name="author" content="Tao Junjie">
<link rel="prev" href="defining-the-gaussian-process-object-directly.html" title="defining-the-gaussian-process-object-directly" type="text/html">
<link rel="next" href="imputing-missing-values-through-various-strategies.html" title="imputing-missing-values-through-various-strategies" type="text/html">
<meta property="og:site_name" content="绿萝间">
<meta property="og:title" content="getting-sample-data-from-external-sources">
<meta property="og:url" content="https://muxuezi.github.io/posts/getting-sample-data-from-external-sources.html">
<meta property="og:description" content="从外部源获取样本数据¶








如果条件允许，学本书内容时尽量用你熟悉的数据集；方便起见，我们用scikit-learn的内置数据库。这些内置数据库可用于测试不同的建模技术，如回归和分类。而且这些内置数据库都是非常著名的数据库。这对不同领域的学术论文的作者们来说是很用的，他们可以用这些内置数据库将他们的模型与其他模型进行比较。
推荐使用IPython来运行文中的指令。大内存很重要，这样可以">
<meta property="og:type" content="article">
<meta property="article:published_time" content="2015-07-27T14:58:10+08:00">
<meta property="article:tag" content="CHS">
<meta property="article:tag" content="ipython">
<meta property="article:tag" content="Machine Learning">
<meta property="article:tag" content="Python">
<meta property="article:tag" content="scikit-learn cookbook">
</head>
<body>
<a href="#content" class="sr-only sr-only-focusable">Skip to main content</a>

<!-- Menubar -->

<nav class="navbar navbar-inverse navbar-static-top"><div class="container">
<!-- This keeps the margins nice -->
        <div class="navbar-header">
            <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#bs-navbar" aria-controls="bs-navbar" aria-expanded="false">
            <span class="sr-only">Toggle navigation</span>
            <span class="icon-bar"></span>
            <span class="icon-bar"></span>
            <span class="icon-bar"></span>
            </button>
            <a class="navbar-brand" href="https://muxuezi.github.io/">

                <span id="blog-title">绿萝间</span>
            </a>
        </div>
<!-- /.navbar-header -->
        <div class="collapse navbar-collapse" id="bs-navbar" aria-expanded="false">
            <ul class="nav navbar-nav">
<li>
<a href="../archive.html">Archive</a>
                </li>
<li>
<a href="../categories/">Tags</a>
                </li>
<li>
<a href="../rss.xml">RSS feed</a>

                
            </li>
</ul>
<ul class="nav navbar-nav navbar-right">
<li>
    <a href="getting-sample-data-from-external-sources.ipynb" id="sourcelink">Source</a>
    </li>

                
            </ul>
</div>
<!-- /.navbar-collapse -->
    </div>
<!-- /.container -->
</nav><!-- End of Menubar --><div class="container" id="content" role="main">
    <div class="body-content">
        <!--Body content-->
        <div class="row">
            
            
<article class="post-text h-entry hentry postpage" itemscope="itemscope" itemtype="http://schema.org/Article"><header><h1 class="p-name entry-title" itemprop="headline name"><a href="#" class="u-url">getting-sample-data-from-external-sources</a></h1>

        <div class="metadata">
            <p class="byline author vcard"><span class="byline-name fn">
                    Tao Junjie
            </span></p>
            <p class="dateline"><a href="#" rel="bookmark"><time class="published dt-published" datetime="2015-07-27T14:58:10+08:00" itemprop="datePublished" title="2015-07-27 14:58">2015-07-27 14:58</time></a></p>
            
        <p class="sourceline"><a href="getting-sample-data-from-external-sources.ipynb" id="sourcelink">Source</a></p>

        </div>
        

    </header><div class="e-content entry-content" itemprop="articleBody text">
    <div tabindex="-1" id="notebook" class="border-box-sizing">
    <div class="container" id="notebook-container">

<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="从外部源获取样本数据">从外部源获取样本数据<a class="anchor-link" href="getting-sample-data-from-external-sources.html#%E4%BB%8E%E5%A4%96%E9%83%A8%E6%BA%90%E8%8E%B7%E5%8F%96%E6%A0%B7%E6%9C%AC%E6%95%B0%E6%8D%AE">¶</a>
</h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>如果条件允许，学本书内容时尽量用你熟悉的数据集；方便起见，我们用scikit-learn的内置数据库。这些内置数据库可用于测试不同的建模技术，如回归和分类。而且这些内置数据库都是非常著名的数据库。这对不同领域的学术论文的作者们来说是很用的，他们可以用这些内置数据库将他们的模型与其他模型进行比较。</p>
<blockquote>
<p>推荐使用IPython来运行文中的指令。大内存很重要，这样可以让普通的命令正常运行。如果用IPython Notebook就更好了。如果你用Notebook，记得用<code>%matplotlib inline</code>指令，这样图象就会出现在Notebook里面，而不是一个新窗口里。</p>
</blockquote>
<!-- TEASER_END -->
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Getting-ready">Getting ready<a class="anchor-link" href="getting-sample-data-from-external-sources.html#Getting-ready">¶</a>
</h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>scikit-learn的内置数据库在<code>datasets</code>模块里。用如下命令导入：</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [1]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">sklearn</span> <span class="k">import</span> <span class="n">datasets</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>在IPython里面运行<code>datasets.*?</code>就会看到<code>datasets</code>模块的指令列表。</p>

</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="How-to-do-it…">How to do it…<a class="anchor-link" href="getting-sample-data-from-external-sources.html#How-to-do-it%E2%80%A6">¶</a>
</h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><code>datasets</code>模块主要有两种数据类型。较小的测试数据集在<code>sklearn</code>包里面，可以通过<code>datasets.load_*?</code>查看。较大的数据集可以根据需要下载。后者默认情况下不在<code>sklearn</code>包里面；但是，有时这些大数据集可以更好的测试模型和算法，因为比较复杂足以模拟现实情形。</p>
<p>默认在<code>sklearn</code>包里面的数据集可以通过<code>datasets.load_*?</code>查看。另外一些数据集需要通过<code>datasets.fetch_*?</code>下载，这些数据集更大，没有被自动安装。经常用于测试那些解决实际问题的算法。</p>
<p>首先，加载<code>boston</code>数据集看看：</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [3]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">boston</span> <span class="o">=</span> <span class="n">datasets</span><span class="o">.</span><span class="n">load_boston</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="n">boston</span><span class="o">.</span><span class="n">DESCR</span><span class="p">)</span>
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics &amp; Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh &amp; Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh &amp; Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><code>DESCR</code>将列出数据集的一些概况。下面我们来下载一个数据集：</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [4]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">housing</span> <span class="o">=</span> <span class="n">datasets</span><span class="o">.</span><span class="n">fetch_california_housing</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="n">housing</span><span class="o">.</span><span class="n">DESCR</span><span class="p">)</span>
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>downloading Cal. housing from http://lib.stat.cmu.edu/modules.php?op=modload&amp;name=Downloads&amp;file=index&amp;req=getit&amp;lid=83 to C:\Users\tj2\scikit_learn_data
California housing dataset.

The original database is available from StatLib

    http://lib.stat.cmu.edu/

The data contains 20,640 observations on 9 variables.

This dataset contains the average house value as target variable
and the following input variables (features): average income,
housing average age, average rooms, average bedrooms, population,
average occupation, latitude, and longitude in that order.

References
----------

Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297.


</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="How-it-works…">How it works…<a class="anchor-link" href="getting-sample-data-from-external-sources.html#How-it-works%E2%80%A6">¶</a>
</h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>当这些数据集被加载时，它们并不是直接转换成Numpy数组。它们是<code>Bunch</code>类型。<strong>Bunch</strong>是Python常用的数据结构。基本可以看成是一个词典，它的键被实例对象作为属性使用。</p>
<p>用<code>data</code>属性连接数据中包含自变量的Numpy数组，用<code>target</code>属性连接数据中的因变量。</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [5]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">boston</span><span class="o">.</span><span class="n">data</span><span class="p">,</span> <span class="n">boston</span><span class="o">.</span><span class="n">target</span>
</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>网络上<code>Bunch</code>对象有不同的实现；自己写一个也不难。scikit-learn用<a href="https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/base.py">基本模块</a>定义<code>Bunch</code>。</p>

</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="There's-more…">There's more…<a class="anchor-link" href="getting-sample-data-from-external-sources.html#There's-more%E2%80%A6">¶</a>
</h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>让你从外部源获取数据集时，它默认会被当前文件夹的<code>scikit_learn_data/</code>放在里面，可以通过两种方式进行配置：</p>
<ul>
<li>设置<code>SCIKIT_LEARN_DATA</code>环境变量指定下载位置</li>
<li>
<code>fetch_*?</code>方法的第一个参数是<code>data_home</code>，可以知道下载位置</li>
</ul>
<p>通过<code>datasets.get_data_home()</code>很容易检查默认下载位置。</p>

</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="See-also">See also<a class="anchor-link" href="getting-sample-data-from-external-sources.html#See-also">¶</a>
</h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>UCI机器学习库（UCI Machine Learning Repository）是找简单数据集的好地方。很多scikit-learn的数据集都在那里，那里还有更多的数据集。其他数据源还是著名的KDD和Kaggle。</p>

</div>
</div>
</div>
    </div>
  </div>

    </div>
    <aside class="postpromonav"><nav><ul itemprop="keywords" class="tags">
<li><a class="tag p-category" href="../categories/chs.html" rel="tag">CHS</a></li>
            <li><a class="tag p-category" href="../categories/ipython.html" rel="tag">ipython</a></li>
            <li><a class="tag p-category" href="../categories/machine-learning.html" rel="tag">Machine Learning</a></li>
            <li><a class="tag p-category" href="../categories/python.html" rel="tag">Python</a></li>
            <li><a class="tag p-category" href="../categories/scikit-learn-cookbook.html" rel="tag">scikit-learn cookbook</a></li>
        </ul>
<ul class="pager hidden-print">
<li class="previous">
                <a href="defining-the-gaussian-process-object-directly.html" rel="prev" title="defining-the-gaussian-process-object-directly">Previous post</a>
            </li>
            <li class="next">
                <a href="imputing-missing-values-through-various-strategies.html" rel="next" title="imputing-missing-values-through-various-strategies">Next post</a>
            </li>
        </ul></nav></aside><script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"> </script><script type="text/x-mathjax-config">
MathJax.Hub.Config({
    tex2jax: {
        inlineMath: [ ['$','$'], ["\\(","\\)"] ],
        displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
        processEscapes: true
    },
    displayAlign: 'center', // Change this to 'center' to center equations.
    "HTML-CSS": {
        styles: {'.MathJax_Display': {"margin": 0}}
    }
});
</script></article>
</div>
        <!--End of body content-->

        <footer id="footer">
            Contents © 2017         <a href="mailto:muxuezi@gmail.com">Tao Junjie</a> - Powered by         <a href="https://getnikola.com" rel="nofollow">Nikola</a>         
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0">
<img alt="Creative Commons License BY-NC-SA" style="border-width:0; margin-bottom:12px;" src="http://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png"></a>
            
        </footer>
</div>
</div>


            <script src="../assets/js/all-nocdn.js"></script><script>$('a.image-reference:not(.islink) img:not(.islink)').parent().colorbox({rel:"gal",maxWidth:"100%",maxHeight:"100%",scalePhotos:true});</script><!-- fancy dates --><script>
    moment.locale("en");
    fancydates(0, "YYYY-MM-DD HH:mm");
    </script><!-- end fancy dates --><script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-51330059-1', 'auto');
  ga('send', 'pageview');

</script>
</body>
</html>
